Practical Data-Dependent Metric Compression with Provable Guarantees
نویسندگان
چکیده
How well can one compress a dataset of points from a high-dimensional space while preserving pairwise distances? Indyk and Wagner have recently obtained almost optimal bounds for this problem, but their construction (based on hierarchical clustering) is not practical. In this talk, I will show a new practical, quadtree-based compression scheme, whose provable performance essentially matches that of the result of Indyk and Wagner. In additional to the theoretical results, we will see experimental comparison of the new scheme and Product Quantization (PQ)–one of the most popular heuristics for distance-preserving compression–on several datasets. Unlike PQ and other heuristics that rely on the clusterability of the dataset, the new algorithm ends up being more robust. The talk is based on a joint work with Piotr Indyk and Tal Wagner. Organizer(s): Rutgers/DIMACS Theory of Computing
منابع مشابه
A Practical Algorithm for Topic Modeling with Provable Guarantees
Topic models provide a useful method for dimensionality reduction and exploratory data analysis in large text corpora. Most approaches to topic model learning have been based on a maximum likelihood objective. Efficient algorithms exist that attempt to approximate this objective, but they have no provable guarantees. Recently, algorithms have been introduced that provide provable bounds, but th...
متن کاملCluster-Aware Compression with Provable K-means Preservation
This work rigorously explores the design of clusterpreserving compression schemes for high-dimensional data. We focus on the K-means algorithm and identify conditions under which running the algorithm on the compressed data yields the same clustering outcome as on the original. The compression is performed using single and multi-bit minimum mean square error quantization schemes as well as a gi...
متن کاملError-Resilient Optimal Data Compression
The problem of communication and computation in the presence of errors is difficult, and general solutions can be time consuming and inflexible (particularly when implemented with a prescribed error detection/correction). A reasonable approach is to investigate reliable communication in carefully selected areas of fundamental interest where specific solutions may be more practical than general ...
متن کاملOn Computing Compression Trees for Data Collection in Sensor Networks
We address the problem of efficiently gathering correlated data from a wired or a wireless sensor network, with the aim of designing algorithms with provable optimality guarantees, and understanding how close we can get to the known theoretical lower bounds. Our proposed approach is based on finding an optimal or a near-optimal compression tree for a given sensor network: a compression tree is ...
متن کاملWorkload-Optimal Histograms on Streams
Histograms are used in many ways in conventional databases and in data stream processing for summarizing massive data distributions. Previous work on constructing histograms on data streams with provable guarantees have not taken into account the workload characteristics of databases which show some parts of the distributions to be more frequently used than the others; on the other hand, previo...
متن کامل